fix: prevent premature stream closure in EventConsumer grace period#764
fix: prevent premature stream closure in EventConsumer grace period#764kabir wants to merge 5 commits intoa2aproject:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses an issue where the EventConsumer was prematurely closing streams during the grace period, before all final events were fully processed. The fix introduces a mechanism to track whether a final event is still in transit, preventing the grace period timeout from starting until all events have been delivered. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request addresses a race condition that could cause premature stream closure in EventConsumer. The solution, which involves introducing an awaitingFinalEvent flag to delay the grace period timeout, is well-reasoned and correctly implemented in EventConsumer. However, there is a potential concurrency issue in EventQueue.ChildQueue with the new flag that should be addressed to ensure thread safety. The review includes a comment with a high-severity finding to address this potential data race, recommending the use of volatile for thread safety.
| public boolean isAwaitingFinalEvent() { | ||
| return awaitingFinalEvent; | ||
| } |
There was a problem hiding this comment.
The awaitingFinalEvent flag is accessed from multiple threads without ensuring visibility of changes. The EventConsumer thread reads the flag via this isAwaitingFinalEvent() method, while another thread writes to it when an event is enqueued. This can lead to a race condition where the consumer thread reads a stale value.
To guarantee that writes to awaitingFinalEvent are visible to other threads, it should be declared as volatile. Using AtomicBoolean with compareAndSet() is not strictly necessary since the variable is thread-confined.
For example, in ChildQueue:
private volatile boolean awaitingFinalEvent = false;This will establish a proper happens-before relationship and ensure the fix is thread-safe.
References
- When a variable is strictly thread-confined and never shared between threads,
volatileis sufficient for visibility and correctness. UsingAtomicBooleanwithcompareAndSet()is a general recommendation for managing shared mutable state across multiple threads, but it's not strictly necessary in a thread-confined context.
c10c262 to
2d7573a
Compare
2d7573a to
3d3d8d0
Compare
The EventConsumer grace period logic could close streams prematurely when final events were still in-transit through MainEventBusProcessor. This manifested as intermittent test failures where the stream would close before all events were delivered. Root Cause: - When agent execution completes, EventConsumer enters a grace period - It polls the ChildQueue with 500ms timeout, allowing 3 consecutive timeouts (1.5s total) before closing the stream - The original logic only checked queue.size() == 0 - However, final events can be in-transit: MainQueue → MainEventBus → MainEventBusProcessor → ChildQueue - This timing window (typically <500ms) could result in premature closure when the local queue was empty but the final event hadn't arrived yet Solution: - Added EventQueue.isAwaitingFinalEvent() method - MainQueue calls child.expectFinalEvent() when enqueueing final events - EventConsumer checks awaitingFinalEvent flag before starting timeout counter: agentCompleted && queueSize == 0 && !awaitingFinal - ChildQueue clears the flag only when a FINAL event is dequeued (not on any event, to avoid clearing it too early when non-final events arrive) - This ensures the grace period doesn't start counting down while a final event is still being distributed The fix handles both local execution (events available immediately) and replicated scenarios (events may arrive via Kafka with delays).
Add GitHub Actions workflow to run the intermittent tests 100 times across all transports (REST, JSON-RPC, gRPC) and JDK versions (17, 21, 25). Tests verified: - testAgentToAgentLocalHandling - testNonBlockingWithMultipleMessages - testAuthRequiredWorkflow The workflow stops on first failure and uploads surefire reports for debugging. This is a temporary workflow to validate the fix and will be removed once verified on CI.
3d3d8d0 to
714b1f1
Compare
The timeout logic was setting a local variable to false, but the next iteration would read true again from queue.isAwaitingFinalEvent(), causing the grace period to never start. Added clearAwaitingFinalEvent() method to ChildQueue to properly clear the flag on the queue itself. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The EventConsumer grace period logic could close streams prematurely when final events were still in-transit through MainEventBusProcessor. This manifested as intermittent test failures where the stream would close before all events were delivered.
Root Cause:
Solution:
The fix handles both local execution (events available immediately) and replicated scenarios (events may arrive via Kafka with delays).